Introduction

This Rmarkdown script (and corresponding TAB-separated CSV input data file InfoRateData.csv and the resulting HTML document) contain the full analysis and plotting code accompanying the paper Human languages share an optimal information transmission rate.

The data

For more information on the data, please see Oh (2015). There are in total 17 languages (see the Table below).

The Oral Corpus

The oral corpus is based on a subset of the Multext (Multilingual Text Tools and Corpora) parallel corpus (Campione & Véronis, 1998) in British English, German, and Italian. The material consists of 15 short texts of 3-5 semantically connected sentences carefully translated by a native speaker in each language.

For the other 14 languages, two of the authors supervised the translation and recording of new datasets. All participants were native speakers of the target language, with a focus on a specific variety of the language when possible – e.g. Mandarin spoken in Beijing, Serbian in Belgrade and Korean in Seoul. No strict control on age or on the speakers’ social diversity was performed, but speakers were mainly students or members of academic institutions. Speakers were asked to read three times (first silently and then loudly twice) each text. The texts were presented one by one on the screen in random order, in a self-paced reading paradigm. This way, speakers familiarized themselves with the text and reduce their reading errors. The second loud recording was analyzed in this study.

Text Corpus

Text datasets were acquired from various sources as illustrated in the Table below. After an initial data curation, each dataset was phonetically transcribed and automatically syllabified by a rule-based program written by one of the authors, except in the following cases:

  1. when syllabification was already provided with the dataset (English, French, German, and Vietnamese for the multisyllabic words);
  2. when the corpus was syllabified by an automatic grapheme-to-phoneme converter (Catalan, Spanish, and Thai).

Additionally, no syllabification was required for Sino-Tibetan languages (Cantonese and Mandarin Chinese) since one ideogram corresponds to one syllable.

Language Family ISO 639-3 Corpus
Basque Basque EUS E-Hitz (Perea et al., 2006)
British English Indo-European ENG WebCelex (MPI for Psycholinguistics)
Cantonese Sino-Tibetan YUE A linguistic corpus of mid-20th century Hong Kong Cantonese
Catalan Indo-European CAT Frequency dictionary (Zséder et al., 2012)
Finnish Uralic FIN Finnish Parole Corpus
French Indo-European FRA Lexique 3.80 (New et al., 2001)
German Indo-European DEU WebCelex (MPI for Psycholinguistics)
Hungarian Uralic HUN Hungarian National Corpus (Váradi, 2002)
Italian Indo-European ITA The Corpus PAISÀ (Lyding et al., 2014)
Japanese Japanese JPN Japanese Internet Corpus (Sharoff, 2006)
Korean Korean KOR Leipzig Corpora Collection (LCC)
Mandarin Chinese Sino-Tibetan CMN Chinese Internet Corpus (Sharoff, 2006)
Serbian Indo-European SRP Frequency dictionary (Zséder et al., 2012)
Spanish Indo-European SPA Frequency dictionary (Zséder et al., 2012)
Thai Tai-Kadai THA Thai National Corpus (TNC)
Turkish Turkic TUR Leipzig Corpora Collection (LCC)
Vietnamese Austroasiatic VIE VNSpeechCorpus (Le et al., 2004)

Dataset structure

The data is structured as follows:

  • Language is the unique language ISO 639-3 ID with 17 possible values “CAT”, “CMN”, “DEU”, “ENG”, “EUS”, “FIN”, “FRA”, “HUN”, “ITA”, “JPN”, “KOR”, “SPA”, “SRP”, “THA”, “TUR”, “VIE”, “YUE”;
  • Language family is the language family with 9 possible values “Austroasiatic”, “Basque”, “Indo-European”, “Japanese”, “Korean”, “Sino-Tibetan”, “Tai-Kadai”, “Turkic”, “Uralic”;
  • Text is the text identifier (similar across languages and speakers) with 15 possible values “O1”, “O2”, “O3”, “O4”, “O6”, “O8”, “O9”, “P0”, “P1”, “P2”, “P3”, “P8”, “P9”, “Q0”, “Q1”;
  • Speaker is the unique speaker’s ID: 170;
  • Sex is the speaker’s sex;
  • Duration is the texts’ duration in seconds as spoken by the given speaker (pauses longer than 150ms being excluded);
  • NS is the texts’ number of of syllables (the same for a given text in a given language);
  • SR is the speech rate (syllables/second) for a given text spoken by a given speaker;
  • ShE (resp. ID) is the first-order (resp. second-order) language-level entropy estimate;
  • ShIR and IR are text * speaker properties obtained by multiplying SR by ShE and ID respectively.

Exploratory plots and summaries

NS

NS: exploratory plots.

NS: exploratory plots.

mean=101.217, median=100, sd=24.694, CV=0.244, min=49, max=162, kurtosis=2.485, skewness=0.178.

SR

SR: exploratory plots.

SR: exploratory plots.

SR per speaker.

SR per speaker.

SR by language.

SR by language.

mean=6.612, median=6.76, sd=1.133, CV=0.171, min=3.59, max=9.38, kurtosis=2.382, skewness=-0.197.

ShE and ID

ShE and ID: exploratory plots.

ShE and ID: exploratory plots.

ShE vs ID.

ShE vs ID.


    Pearson's product-moment correlation

data:  tmp1$ShE and tmp1$ID
t = 2.0326, df = 15, p-value = 0.06019
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02052914  0.77274779
sample estimates:
      cor 
0.4647009 

    Spearman's rank correlation rho

data:  tmp1$ShE and tmp1$ID
S = 451.88, p-value = 0.07259
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4462208 

    Paired t-test

data:  tmp1$ShE and tmp1$ID
t = 11.635, df = 16, p-value = 3.213e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.158040 3.119607
sample estimates:
mean of the differences 
               2.638824 

ShE:

mean=8.619, median=8.69, sd=0.906, CV=0.105, min=6.07, max=9.83, kurtosis=4.662, skewness=-1.124.

ID:

mean=6.009, median=5.56, sd=0.883, CV=0.147, min=4.83, max=8.02, kurtosis=2.521, skewness=0.741.

ShIR and IR

ShIR: exploratory plots.

ShIR: exploratory plots.

ShIR per speaker.

ShIR per speaker.

ShIR by language.

ShIR by language.

IR: exploratory plots.

IR: exploratory plots.

IR per speaker.

IR per speaker.

IR by language.

IR by language.

ShIR:

mean=56.525, median=56.92, sd=9.217, CV=0.163, min=32.77, max=83.04, kurtosis=2.335, skewness=0.048.

IR:

mean=39.033, median=39.05, sd=4.945, CV=0.127, min=25.64, max=56.3, kurtosis=3.422, skewness=0.241.

SR and ID

SR vs ID

SR vs ID


    Pearson's product-moment correlation

data:  info.rate.data$SR and info.rate.data$ID
t = -46.635, df = 2263, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7204624 -0.6784227
sample estimates:
       cor 
-0.7000486 

    Spearman's rank correlation rho

data:  info.rate.data$SR and info.rate.data$ID
S = 3315700000, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.7120735 

Linear Mixed Modells (LMMs)

Linear ICCs

ICC for model NS ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.30
Language 0.54
Speaker 0.00
ICC for model SR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.01
Language 0.62
Speaker 0.29
ICC for model ShIR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.01
Language 0.56
Speaker 0.34
ICC for model IR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.02
Language 0.36
Speaker 0.49

SR

Comparing models

AIC and BIC for various hierarchical models of SR (in bold are the minimum values).
model AIC BIC
1 + (1 | Text) + (1 | Language) + (1 | Speaker) 1998.03 2026.65
1 + (1 | Language) + (1 | Speaker) 2269.3 2292.2
1 + (1 | Text) + (1 | Speaker) 2131.96 2154.86
1 + (1 | Text) + (1 | Language) 4490.99 4513.89
1 + Sex + (1 | Text) + (1 | Language) + (1 | Speaker) 1990.57 2024.92
1 + Sex + (1 | Language) + (1 | Speaker) 2261.69 2290.31
1 + Sex + (1 | Text) + (1 | Speaker) 2131.72 2160.35
1 + Sex + (1 | Text) + (1 | Language) 4364.38 4393.01

Residuals and random effects

We consider here the full model SR ~ 1 + Sex + (1|Text) + (1|Language) + (1|Speaker).

IR

Comparing models

AIC and BIC for various hierarchical models of IR (in bold are the minimum values).
model AIC BIC
1 + (1 | Text) + (1 | Language) + (1 | Speaker) 10041.03 10069.65
1 + (1 | Language) + (1 | Speaker) 10307.63 10330.53
1 + (1 | Text) + (1 | Speaker) 10093.09 10115.99
1 + (1 | Text) + (1 | Language) 12574.43 12597.33
1 + Sex + (1 | Text) + (1 | Language) + (1 | Speaker) 10029.53 10063.88
1 + Sex + (1 | Language) + (1 | Speaker) 10295.98 10324.6
1 + Sex + (1 | Text) + (1 | Speaker) 10086.35 10114.97
1 + Sex + (1 | Text) + (1 | Language) 12440.04 12468.67

Residuals and random effects

We consider here the full model IR ~ 1 + Sex + (1|Text) + (1|Language) + (1|Speaker).

Generalized Additive Models for Location, Scale and Shape (GAMLSS)

We will use a Gaussian distribution (with fixed or modelled variance).

SR

Fixed or modelled σ (variance)

Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  6.666774e-05 
                       variance   =  1.000442 
               coef. of skewness  =  -0.0552563 
               coef. of kurtosis  =  3.323173 
Filliben correlation coefficient  =  0.9990793 
******************************************************************


Deviance= 1048.516 

AIC= 1447.603 
Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001738762 
                       variance   =  1.000438 
               coef. of skewness  =  -0.02112376 
               coef. of kurtosis  =  2.757837 
Filliben correlation coefficient  =  0.9991447 
******************************************************************


Deviance= 719.0835 

AIC= 1299.257 

The distribution of the residuals is less heteroscedastic than before and the fit to the data better. The full summary of the model is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex + random(Text) + random(Language) +      random(Speaker), sigma.formula = ~1 + Sex + random(Text) +      random(Language) + random(Speaker), family = NO(mu.link = "identity"),  
    data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.453290   0.007618  847.14   <2e-16 ***
SexM        0.320159   0.011487   27.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.30948    0.02098 -62.417  < 2e-16 ***
SexM         0.11194    0.02972   3.767  0.00017 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2265 
Degrees of Freedom for the fit:  290.0869
      Residual Deg. of Freedom:  1974.913 
                      at cycle:  48 
 
Global Deviance:     719.0835 
            AIC:     1299.257 
            SBC:     2960.101 
******************************************************************

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.3977 
Random effect parameter sigma_b: 0.108314 
Smoothing parameter lambda     : 85.7824 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 16.9836 
Random effect parameter sigma_b: 0.87066 
Smoothing parameter lambda     : 1.32915 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 165.2778 
Random effect parameter sigma_b: 0.555308 
Smoothing parameter lambda     : 3.49815 
σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.583681 
Random effect parameter sigma_b: 0.0108169 
Smoothing parameter lambda     : 7523.57 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.91065 
Random effect parameter sigma_b: 0.161228 
Smoothing parameter lambda     : 34.0803 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 73.93344 
Random effect parameter sigma_b: 0.165342 
Smoothing parameter lambda     : 33.2784 

IR

Fixed or modelled σ (variance)

Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.0002550467 
                       variance   =  1.000442 
               coef. of skewness  =  0.01302313 
               coef. of kurtosis  =  3.360878 
Filliben correlation coefficient  =  0.9990567 
******************************************************************


Deviance= 9108.279 

AIC= 9507.442 
Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001906696 
                       variance   =  1.000449 
               coef. of skewness  =  -0.02187398 
               coef. of kurtosis  =  2.755234 
Filliben correlation coefficient  =  0.9991259 
******************************************************************


Deviance= 8782.878 

AIC= 9364.641 

Again, this is a better fit to the data. The full summary of the model is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex + random(Text) + random(Language) +      random(Speaker), sigma.formula = ~1 + Sex + random(Text) +      random(Language) + random(Speaker), family = NO(mu.link = "identity"),  
    data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 38.08916    0.04521  842.56   <2e-16 ***
SexM         1.89529    0.06851   27.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.47207    0.02098  22.501  < 2e-16 ***
SexM         0.11385    0.02972   3.831 0.000131 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2265 
Degrees of Freedom for the fit:  290.8814
      Residual Deg. of Freedom:  1974.119 
                      at cycle:  7 
 
Global Deviance:     8782.878 
            AIC:     9364.641 
            SBC:     11030.03 
******************************************************************

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.42046 
Random effect parameter sigma_b: 0.658373 
Smoothing parameter lambda     : 2.32185 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 16.9518 
Random effect parameter sigma_b: 3.06988 
Smoothing parameter lambda     : 0.106913 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 165.1517 
Random effect parameter sigma_b: 3.31521 
Smoothing parameter lambda     : 0.0981439 
σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.09852989 
Random effect parameter sigma_b: 0.00428681 
Smoothing parameter lambda     : 47821.8 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.96383 
Random effect parameter sigma_b: 0.163535 
Smoothing parameter lambda     : 33.0774 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 75.29513 
Random effect parameter sigma_b: 0.168066 
Smoothing parameter lambda     : 32.181 

Modelling the relationship between SR and ID

Let’s model SR with ID as an additional predictor (fixed effect). N.B. In this case, we must drop Language as a random effect, since each language has, by definition, only one value of ID.

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + ID + Sex + random(Text) +      random(Speaker), sigma.formula = ~1 + ID + Sex +      random(Text) + random(Speaker), family = NO(mu.link = "identity"),      data = d, control = gamlss.control(n.cyc = 800,  
        trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.794174   0.036777  320.69   <2e-16 ***
ID          -0.888238   0.005842 -152.06   <2e-16 ***
SexM         0.310995   0.011382   27.32   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.79946    0.10475  -7.632 3.57e-14 ***
ID          -0.08642    0.01705  -5.070 4.35e-07 ***
SexM         0.10951    0.02972   3.685 0.000235 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2265 
Degrees of Freedom for the fit:  293.322
      Residual Deg. of Freedom:  1971.678 
                      at cycle:  18 
 
Global Deviance:     686.5481 
            AIC:     1273.192 
            SBC:     2952.558 
******************************************************************

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001881969 
                       variance   =  1.000438 
               coef. of skewness  =  -0.02304788 
               coef. of kurtosis  =  2.677419 
Filliben correlation coefficient  =  0.9989511 
******************************************************************


Deviance= 686.5481 

AIC= 1273.192 

Adding ID as a predictor improves the fits (as judged by AIC). There is a negative estimate for ID, but significance is difficult to assess with GAMLSS model involving smoothing functions. However, also using a simple lmer model we have a significant effect of ID:

Type III Analysis of Variance Table with Satterthwaite's method
     Sum Sq Mean Sq NumDF  DenDF  F value    Pr(>F)    
ID  16.4136 16.4136     1 166.55 162.5728 < 2.2e-16 ***
Sex  0.8047  0.8047     1 166.63   7.9707  0.005334 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.42354 
Random effect parameter sigma_b: 0.10986 
Smoothing parameter lambda     : 83.3869 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 167.487 
Random effect parameter sigma_b: 0.74601 
Smoothing parameter lambda     : 1.94032 

σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 1.916828 
Random effect parameter sigma_b: 0.0201916 
Smoothing parameter lambda     : 2061.61 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 103.4947 
Random effect parameter sigma_b: 0.234219 
Smoothing parameter lambda     : 16.0415 

Mixture of Gaussians

In what follows, mixing probabilities are independent from factors such as Sex.

SR

Between 1 and 5 Gaussian distributions:

1 component


Mixing Family:  "NO" 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 1, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      6.612  
Sigma Coefficients for model: 1 
(Intercept)  
     0.1248  

Estimated probabilities: 1 

Degrees of Freedom for the fit: 2 Residual Deg. of Freedom   2263 
Global Deviance:     6993.16 
            AIC:     6997.16 
            SBC:     7008.61 

2 components


Mixing Family:  c("NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 2, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      5.293  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.4553  
Mu Coefficients for model: 2 
(Intercept)  
      7.157  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.2298  

Estimated probabilities: 0.2925825 0.7074175 

Degrees of Freedom for the fit: 5 Residual Deg. of Freedom   2260 
Global Deviance:     6861.89 
            AIC:     6871.89 
            SBC:     6900.52 

3 components


Mixing Family:  c("NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 3, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
       5.23  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.4888  
Mu Coefficients for model: 2 
(Intercept)  
      6.612  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.1508  
Mu Coefficients for model: 3 
(Intercept)  
      7.271  
Sigma Coefficients for model: 3 
(Intercept)  
     -0.263  

Estimated probabilities: 0.2537915 0.2147315 0.531477 

Degrees of Freedom for the fit: 8 Residual Deg. of Freedom   2257 
Global Deviance:     6862.04 
            AIC:     6878.04 
            SBC:     6923.84 

4 components


Mixing Family:  c("NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 4, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      7.171  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.2973  
Mu Coefficients for model: 2 
(Intercept)  
      5.092  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.5485  
Mu Coefficients for model: 3 
(Intercept)  
      5.806  
Sigma Coefficients for model: 3 
(Intercept)  
     -0.454  
Mu Coefficients for model: 4 
(Intercept)  
      7.316  
Sigma Coefficients for model: 4 
(Intercept)  
    -0.2168  

Estimated probabilities: 0.4443139 0.1849648 0.1507718 0.2199495 

Degrees of Freedom for the fit: 11 Residual Deg. of Freedom   2254 
Global Deviance:     6861.1 
            AIC:     6883.1 
            SBC:     6946.08 

5 components


Mixing Family:  c("NO", "NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 5, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      6.375  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.3485  
Mu Coefficients for model: 2 
(Intercept)  
      7.228  
Sigma Coefficients for model: 2 
(Intercept)  
     -1.001  
Mu Coefficients for model: 3 
(Intercept)  
      7.969  
Sigma Coefficients for model: 3 
(Intercept)  
    -0.5569  
Mu Coefficients for model: 4 
(Intercept)  
      6.248  
Sigma Coefficients for model: 4 
(Intercept)  
    -0.4579  
Mu Coefficients for model: 5 
(Intercept)  
      5.041  
Sigma Coefficients for model: 5 
(Intercept)  
    -0.6049  

Estimated probabilities: 0.1788327 0.2320332 0.1993053 0.2001806 0.1896483 

Degrees of Freedom for the fit: 14 Residual Deg. of Freedom   2251 
Global Deviance:     6844.22 
            AIC:     6872.22 
            SBC:     6952.37 

Comparing AIC

            df      AIC
mix.SR.NO.2  5 6871.892
mix.SR.NO.5 14 6872.218
mix.SR.NO.3  8 6878.040
mix.SR.NO.4 11 6883.103
mix.SR.NO.1  2 6997.163

Showing the distributions

Mixture of Gaussians for SR.

Mixture of Gaussians for SR.

IR

Between 1 and 5 Gaussian distributions:

1 component


Mixing Family:  "NO" 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 1, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      39.03  
Sigma Coefficients for model: 1 
(Intercept)  
      1.598  

Estimated probabilities: 1 

Degrees of Freedom for the fit: 2 Residual Deg. of Freedom   2263 
Global Deviance:     13667 
            AIC:     13671 
            SBC:     13682.5 

2 components


Mixing Family:  c("NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 2, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      39.76  
Sigma Coefficients for model: 1 
(Intercept)  
      1.766  
Mu Coefficients for model: 2 
(Intercept)  
      38.36  
Sigma Coefficients for model: 2 
(Intercept)  
      1.336  

Estimated probabilities: 0.4795038 0.5204962 

Degrees of Freedom for the fit: 5 Residual Deg. of Freedom   2260 
Global Deviance:     13637.8 
            AIC:     13647.8 
            SBC:     13676.4 

3 components


Mixing Family:  c("NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 3, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      35.42  
Sigma Coefficients for model: 1 
(Intercept)  
      1.339  
Mu Coefficients for model: 2 
(Intercept)  
      39.78  
Sigma Coefficients for model: 2 
(Intercept)  
       0.89  
Mu Coefficients for model: 3 
(Intercept)  
      42.22  
Sigma Coefficients for model: 3 
(Intercept)  
      1.636  

Estimated probabilities: 0.3631272 0.2946715 0.3422013 

Degrees of Freedom for the fit: 8 Residual Deg. of Freedom   2257 
Global Deviance:     13618.4 
            AIC:     13634.4 
            SBC:     13680.2 

4 components


Mixing Family:  c("NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 4, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      42.31  
Sigma Coefficients for model: 1 
(Intercept)  
      1.699  
Mu Coefficients for model: 2 
(Intercept)  
      39.43  
Sigma Coefficients for model: 2 
(Intercept)  
     0.6302  
Mu Coefficients for model: 3 
(Intercept)  
      40.07  
Sigma Coefficients for model: 3 
(Intercept)  
      1.316  
Mu Coefficients for model: 4 
(Intercept)  
      34.48  
Sigma Coefficients for model: 4 
(Intercept)  
       1.25  

Estimated probabilities: 0.2568272 0.1726828 0.3017301 0.2687599 

Degrees of Freedom for the fit: 11 Residual Deg. of Freedom   2254 
Global Deviance:     13613.6 
            AIC:     13635.6 
            SBC:     13698.6 

5 components


Mixing Family:  c("NO", "NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 5, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      42.85  
Sigma Coefficients for model: 1 
(Intercept)  
      1.694  
Mu Coefficients for model: 2 
(Intercept)  
      37.38  
Sigma Coefficients for model: 2 
(Intercept)  
      1.457  
Mu Coefficients for model: 3 
(Intercept)  
      39.58  
Sigma Coefficients for model: 3 
(Intercept)  
     0.4193  
Mu Coefficients for model: 4 
(Intercept)  
      40.57  
Sigma Coefficients for model: 4 
(Intercept)  
      1.308  
Mu Coefficients for model: 5 
(Intercept)  
      35.24  
Sigma Coefficients for model: 5 
(Intercept)  
      1.337  

Estimated probabilities: 0.2128878 0.2184185 0.1195341 0.2222382 0.2269214 

Degrees of Freedom for the fit: 14 Residual Deg. of Freedom   2251 
Global Deviance:     13612.4 
            AIC:     13640.4 
            SBC:     13720.5 

Comparing AIC

            df      AIC
mix.IR.NO.3  8 13634.37
mix.IR.NO.4 11 13635.64
mix.IR.NO.5 14 13640.36
mix.IR.NO.2  5 13647.78
mix.IR.NO.1  2 13671.04

Showing the distributions

Mixture of Gaussians for IR.

Mixture of Gaussians for IR.

Tests of unimodality

We used three ways to estimate how unimodal a distribution is, as they tend to disagree and the problem of unimodality testing is far from settled (see Freeman & Dale, 2013):

  • the Silverman test tests the null hypothesis that an underlying density has at most k modes. The null hypothesis is that the underlying density has at most k modes (H0: number of modes <= k). The result is the p-value (bootstrapped) of rejecting a unimodal distribution. It is described in Silverman (1981) and Hall & York (2001); our implementation is based on the code available at https://www.mathematik.uni-marburg.de/~stochastik/R_packages/;
  • the dip test which computes Hartigans’ dip statistic D_n_, and its p-value for the test for unimodality, by interpolating tabulated quantiles of sqrt(n) • D_n_. For X_i_F, i.i.d., the null hypothesis is that F is a unimodal distribution. The result is the D metric and the p-value (interpolated) of rejecting a unimodal distribution. See Hartigan (1985) and Hartigan & Hartigan (1985); it is implemented in package diptest;
  • the bimodality coefficient (BC) which is based on an empirical relationship between bimodality and the third and fourth statistical moments of a distribution (skewness and kurtosis). It is proportional to the division of squared skewness with uncorrected kurtosis, BC ≈ (s2 + 1)/k, with the underlying logic that a bimodal distribution will have very low kurtosis, an asymmetric character, or both; all of these conditions increase BC. The values range from 0 and 1, with those exceeding .555 (the value representing a uniform distribution) suggesting bi-modality. The result is the DC estimate (which must exceed 0.555 to reject a unimodal distribution). We implemented it following Freeman & Dale (2013) as BC = (s2 + 1)/(k + 3 • ((n-1)2 / ((n-2) • (n-3)))).

For each such test, we performed four randomisation procedures to obtain an estimate of the “specialness” of the observed unimodality estimate; for each new permuted dataset, we recompute everything before estimating the unimodlaity of the permuted distribution:

  • Permutation model 1 (PM1): randomly permute the SR values freely among speakers, texts and languages;
  • Permutation model 2 (PM2): randomly permute the ID values among languages;
  • Permutation model 3 (PM3): randomly permute the Speaker average SR values among speakers (irrespective of languages);
  • Permutation model 4 (PM4): randomly permute the Language average SR values among languages.

Visual comparison

The observed estimate (vertical blue solid line), the permuted distribution (gray histogram), and the “unimodality region” (shaded green rectangle) are shown below (for PM3, we also show the original estimate using the Speaker average SR as a vertical solid red line).

PM1

**Permutation of the texts' SRs (PM1)**.

Permutation of the texts’ SRs (PM1).

PM2

**Permutation of the languages' ID (PM2)**.

Permutation of the languages’ ID (PM2).

PM3

**Permutation of the speakers' average SRs (PM3)**.

Permutation of the speakers’ average SRs (PM3).

PM4

**Permutation of the languages' average SRs with speaker adjustement (PM4)**.

Permutation of the languages’ average SRs with speaker adjustement (PM4).

Summary of unimodality tests

Summary of unimodality permutation tests.
Scenario Measure Test Observed estimate (p-value) % more unimodal permutations
PM1 SR Silverman - (0.019) 74.2%
PM1 SR Dip 0.006 (0.924) * 100%
PM1 SR BC 0.193 () * 100%
PM1 IR Silverman - (0.17) * 86.8%
PM1 IR Dip 0.005 (0.994) * 50%
PM1 IR BC 0.165 () * 100%
PM2 SR Silverman - (0.019) 75.2%
PM2 SR Dip 0.006 (0.924) * 100%
PM2 SR BC 0.193 () * 100%
PM2 IR Silverman - (0.17) * 22.6%
PM2 IR Dip 0.005 (0.994) * 14.8%
PM2 IR BC 0.165 () * 97.7%
PM3 SR Silverman - (0.019) 100%
PM3 SR Dip 0.006 (0.924) * 0%
PM3 SR BC 0.193 () * 100%
PM3 IR Silverman - (0.17) * 86.2%
PM3 IR Dip 0.005 (0.994) * 6%
PM3 IR BC 0.165 () * 100%
PM4 SR Silverman - (0.019) 5.2%
PM4 SR Dip 0.006 (0.924) * 3.1%
PM4 SR BC 0.193 () * 75.5%
PM4 IR Silverman - (0.17) * 11.1%
PM4 IR Dip 0.005 (0.994) * 5.9%
PM4 IR BC 0.165 () * 97.3%

Pair-wise distances between languages

We compute various distances between languages (as implemented by function distance() in package philentropy) in what concerns the distribution of NS, SR and ID.

Comparing the distribution of pairwise distances between languages.

Comparing the distribution of pairwise distances between languages.

Paired permutation t-tests comparing measures with 1,000 permutations.
m1 m2 d mean1 median1 sd1 mean2 median2 sd2 p
IR NS Hellinger 0.89 0.84 0.32 1.20 1.19 0.39 0.00
IR NS Jensen-Shannon 0.18 0.14 0.12 0.29 0.26 0.16 0.00
IR NS Kolmogorov–Smirnov 0.42 0.37 0.20 0.57 0.57 0.23 0.00
IR NS Kullback-Leibler 7.44 4.77 7.88 15.43 13.12 13.30 0.00
IR NS Squared-Chi 0.57 0.46 0.36 0.88 0.80 0.47 0.00
IR SR Hellinger 0.89 0.84 0.32 1.13 1.09 0.49 0.00
IR SR Jensen-Shannon 0.18 0.14 0.12 0.28 0.24 0.20 0.00
IR SR Kolmogorov–Smirnov 0.42 0.37 0.20 0.57 0.56 0.27 0.00
IR SR Kullback-Leibler 7.44 4.77 7.88 15.90 10.08 16.29 0.00
IR SR Squared-Chi 0.57 0.46 0.36 0.88 0.78 0.58 0.00
NS SR Hellinger 1.20 1.19 0.39 1.13 1.09 0.49 0.08
NS SR Jensen-Shannon 0.29 0.26 0.16 0.28 0.24 0.20 0.65
NS SR Kolmogorov–Smirnov 0.57 0.57 0.23 0.57 0.56 0.27 0.99
NS SR Kullback-Leibler 15.43 13.12 13.30 15.90 10.08 16.29 0.75
NS SR Squared-Chi 0.88 0.80 0.47 0.88 0.78 0.58 0.93

References

Campione, E., & Véronis, J. (1998). A multilingual prosodic database, Proc. of the 5th International Conference on Spoken Language Pro cessing (ICSLP’98), Sydney, Australia, 3163-3166.

Freeman, J. B., & Dale, R. (2013). Assessing bimodality to detect the presence of a dual cognitive process. Behavior research methods, 45(1), 83-97.

Hall, P., & York, M. (2001). On the calibration of Silverman’s test for multimodality. Statistica Sinica, 11, 515-536.

Hartigan, J. A., & Hartigan, P. M. (1985) The Dip Test of Unimodality. Annals of Statistics 13, 70–84.

Hartigan, P. M. (1985) Computation of the Dip Statistic to Test for Unimodality. Applied Statistics (JRSS C) 34, 320–325.

Le, V. B., Tran, D. D., Castelli, E., Besacier, L., & Serignat, J. F. (2004). Spoken and Written Language Resources for Vietnamese. In LREC. 4, pp. 599-602.

Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell’Orletta, F., Dittmann, H., Lenci, A., & Pirrelli, V. (2014). The PAISÀ Corpus of Italian Web Texts. In Proceedings of the 9th Web as Corpus Workshop (WaC-9). Association for Computational Linguistics, Gothenburg, Sweden, 36-43.

New B., Pallier C., Ferrand L., & Matos R. (2001). Une base de données lexicales du français contemporain sur internet: LEXIQUE 3.80, L’Année Psychologique, 101, 447-462. http://www.lexique.org.

Oh, Y. M. (2015). Linguistic complexity and information: quantitative approaches. PhD Thesis, Université de Lyon, France. Retrieved from http://www.afcp-parole.org/doc/theses/these_YMO15.pdf

Perea, M., Urkia, M., Davis, C. J., Agirre, A., Laseka, E., & Carreiras, M. (2006). E-Hitz: A word frequency list and a program for deriving psycholinguistic statistics in an agglutinative language (Basque). Behavior Research Methods, 38(4), 610-615.

Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (Eds.) WaCky! Working papers on the web as corpus, Gedit, Bologna, http://corpus.leeds.ac.uk/queryzh.html.

Silverman, B.W. (1981). Using Kernel Density Estimates to investigate Multimodality. Journal of the Royal Statistical Society, Series B, 43, 97-99.

Váradi, T. (2002). The Hungarian National Corpus. In LREC.

Zséder, A., Recski, G., Varga, D., & Kornai, A. (2012). Rapid creation of large-scale corpora and frequency dictionaries. In Proceedings to LREC 2012.

Appendix I: R session info

This document was compiled on:

R version 3.4.4 (2018-03-15)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: grid, compiler, parallel, splines, stats, graphics, grDevices, datasets, utils, methods and base

other attached packages: broman(v.0.68-2), philentropy(v.0.3.0), diptest(v.0.75-7), pander(v.0.6.3), moments(v.0.14), sjPlot(v.2.6.2), sjstats(v.0.17.3), gamlss.mx(v.4.3-5), nnet(v.7.3-12), gamlss(v.5.1-2), nlme(v.3.1-137), gamlss.dist(v.5.1-1), MASS(v.7.3-51.1), gamlss.data(v.5.1-0), lmerTest(v.3.1-0), lme4(v.1.1-20), Matrix(v.1.2-15), plyr(v.1.8.4), reshape2(v.1.4.3), ggplot2(v.3.1.0) and RhpcBLASctl(v.0.18-205)

loaded via a namespace (and not attached): numDeriv(v.2016.8-1), tools(v.3.4.4), TMB(v.1.7.15), backports(v.1.1.3), R6(v.2.4.0), sjlabelled(v.1.0.16), lazyeval(v.0.2.1), colorspace(v.1.4-0), withr(v.2.1.2), tidyselect(v.0.2.5), mnormt(v.1.5-5), emmeans(v.1.3.2), sandwich(v.2.5-0), labeling(v.0.3), scales(v.1.0.0), mvtnorm(v.1.0-8), psych(v.1.8.12), ggridges(v.0.5.1), stringr(v.1.4.0), digest(v.0.6.18), foreign(v.0.8-71), minqa(v.1.2.4), rmarkdown(v.1.11), stringdist(v.0.9.5.1), pkgconfig(v.2.0.2), htmltools(v.0.3.6), highr(v.0.7), pwr(v.1.2-2), rlang(v.0.3.1), generics(v.0.0.2), zoo(v.1.8-4), dplyr(v.0.8.0.1), magrittr(v.1.5), modeltools(v.0.2-22), bayesplot(v.1.6.0), Rcpp(v.1.0.0), munsell(v.0.5.0), prediction(v.0.3.6.2), stringi(v.1.3.1), multcomp(v.1.4-8), yaml(v.2.2.0), snakecase(v.0.9.2), sjmisc(v.2.7.7), forcats(v.0.4.0), crayon(v.1.3.4), lattice(v.0.20-38), ggeffects(v.0.8.0), haven(v.2.1.0), hms(v.0.4.2), knitr(v.1.21), pillar(v.1.3.1), estimability(v.1.3), codetools(v.0.2-16), stats4(v.3.4.4), glue(v.1.3.0), evaluate(v.0.13), data.table(v.1.12.0), modelr(v.0.1.4), nloptr(v.1.2.1), gtable(v.0.2.0), purrr(v.0.3.0), tidyr(v.0.8.2), assertthat(v.0.2.0), xfun(v.0.5), coin(v.1.2-2), xtable(v.1.8-3), broom(v.0.5.1), coda(v.0.19-2), survival(v.2.43-3), tibble(v.2.0.1), glmmTMB(v.0.2.3) and TH.data(v.1.0-10)

Appendix II: Figures for the main paper

Here we generate the figures used in the main paper (saved to the ./figures folder as 600 DPI TIFF files Figure-*.tiff).